InforLorV4, Main, Exploration, bibRecord, 007B68

Creating word-level language models for large-vocabulary handwriting recognition

Identifieur interne : 007B68 ( Main/Exploration ); précédent : 007B67; suivant : 007B69

Creating word-level language models for large-vocabulary handwriting recognition

Auteurs : John F. Pitrelli [États-Unis] ; Amit Roy [États-Unis]

Source :

International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2003.

RBID : Pascal:03-0386044

Descripteurs français

Pascal (Inist)
- Reconnaissance écriture, Reconnaissance forme, Reconnaissance langage, Reconnaissance caractère, Analyse syntaxique, Unigram, Tokenization, Word-level language model, Jeton.

English descriptors

KwdEn :
- Character recognition, Handwriting recognition, Language recognition, Pattern recognition, Syntactic analysis, Token.

Abstract

We discuss development of a word-unigram language model for online handwriting recognition. First, we tokenize a text corpus into words, contrasting with tokenization methods designed for other purposes. Second, we select for our model a subset of the words found discussing deviations from an N-most-frequent-words approach. From a 600-million-word corpus, we generated a 53,000-word model which eliminates 45% of word-recognition errors made by a character-level-model baseline system. We anticipate that our methods will be applicable to offline recognition as well, and to some extent to other recognizers, such as speech recognizers and video retrieval systems.

Affiliations:

États-Unis

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000772
to stream PascalFrancis, to step Curation: 000271
to stream PascalFrancis, to step Checkpoint: 000715
to stream Main, to step Merge: 007F74
to stream Main, to step Curation: 007B68

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Creating word-level language models for large-vocabulary handwriting recognition</title>
<author><name sortKey="Pitrelli, John F" sort="Pitrelli, John F" uniqKey="Pitrelli J" first="John F." last="Pitrelli">John F. Pitrelli</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Roy, Amit" sort="Roy, Amit" uniqKey="Roy A" first="Amit" last="Roy">Amit Roy</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">03-0386044</idno>
<date when="2003">2003</date>
<idno type="stanalyst">PASCAL 03-0386044 INIST</idno>
<idno type="RBID">Pascal:03-0386044</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000772</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000271</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000715</idno>
<idno type="wicri:explorRef" wicri:stream="PascalFrancis" wicri:step="Checkpoint">000715</idno>
<idno type="wicri:doubleKey">1433-2833:2003:Pitrelli J:creating:word:level</idno>
<idno type="wicri:Area/Main/Merge">007F74</idno>
<idno type="wicri:Area/Main/Curation">007B68</idno>
<idno type="wicri:Area/Main/Exploration">007B68</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Creating word-level language models for large-vocabulary handwriting recognition</title>
<author><name sortKey="Pitrelli, John F" sort="Pitrelli, John F" uniqKey="Pitrelli J" first="John F." last="Pitrelli">John F. Pitrelli</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Roy, Amit" sort="Roy, Amit" uniqKey="Roy A" first="Amit" last="Roy">Amit Roy</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2003">2003</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Handwriting recognition</term>
<term>Language recognition</term>
<term>Pattern recognition</term>
<term>Syntactic analysis</term>
<term>Token</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance écriture</term>
<term>Reconnaissance forme</term>
<term>Reconnaissance langage</term>
<term>Reconnaissance caractère</term>
<term>Analyse syntaxique</term>
<term>Unigram</term>
<term>Tokenization</term>
<term>Word-level language model</term>
<term>Jeton</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We discuss development of a word-unigram language model for online handwriting recognition. First, we tokenize a text corpus into words, contrasting with tokenization methods designed for other purposes. Second, we select for our model a subset of the words found discussing deviations from an N-most-frequent-words approach. From a 600-million-word corpus, we generated a 53,000-word model which eliminates 45% of word-recognition errors made by a character-level-model baseline system. We anticipate that our methods will be applicable to offline recognition as well, and to some extent to other recognizers, such as speech recognizers and video retrieval systems.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
</list>
<tree><country name="États-Unis"><noRegion><name sortKey="Pitrelli, John F" sort="Pitrelli, John F" uniqKey="Pitrelli J" first="John F." last="Pitrelli">John F. Pitrelli</name>
</noRegion>
<name sortKey="Roy, Amit" sort="Roy, Amit" uniqKey="Roy A" first="Amit" last="Roy">Amit Roy</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 007B68 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 007B68 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:03-0386044
   |texte=   Creating word-level language models for large-vocabulary handwriting recognition
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022

	Serveur d'exploration sur la recherche en informatique en Lorraine
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur la recherche en informatique en Lorraine

Creating word-level language models for large-vocabulary handwriting recognition

Creating word-level language models for large-vocabulary handwriting recognition

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri